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1, INTRODUCTION 

Humans mainly communicate with each other through words formed from languages. Information 
exchanges are easily achieved between two individuals that speaks the same language. Computers and digital 
systems also works as such. A computer can only understand instructions in one’s and zero’s and thus it is the 
compilers job to translate high level programming language into these instructions. Automatic Speech 
Recognition Systems (ASR) aim to “translate” vocal human words in natural language into information usable 
by computers and digital systems [1]. Speech signal varies greatly based on the context [2]. Even when the 
same speaker says the same word repeatedly will result in variation, even little, in the speech signal produced. 
Human communication are complimented with body language and simpler versions of language that better suit 
two way dialogues [1]. Among these, unclear word boundaries, noise signals, regional and geographical 
dialects, and speaker variability makes building an accurate ASR system harder. 

ASR pre-processing stage will greatly determine the outcome of the later stages. Framing, noise 
removal, and segmentation are common processes that are done during pre-processing [2]. The focus of this 
paper is on continuous audio segmentation. Segmentation algorithm can be categorized as follow [3]: First is 
Metric-based segmentation where audio streams are segmented at the maxima of the distances between 
neighbouring windows placed at fixed sampling intervals. Second, Decoder-guided segmentation where audio 
streams are decoded followed by segmentation at silent points generated by the decoder. Third, Model-based 
segmentation such as the use of Gaussian mixture models. Segments are assumed at locations where there is a 
change in acoustic class. The incoming stream can be classified by using maximum likelihood selection. 
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Speech segmentation can be defined as the process of finding the limits (with specific characteristic) 
in natural spoken language between words, syllables or phonemes. [4, 5]. The main objective of Speech 
segmentation is to serve other speech analysis problems such as speech synthesis, data training for speech 
recognizers, or to fabricate and label prosodic databases. Therefore, it can be viewed as a vital sub-issue for 
various fields in speech analysis and research. [6, 7]. The traditional approach handling this issue is by manual 
segmentation of speech, which is generally performed by specialized phoneticians. However, this method is 
based on listening and visual judgment on required boundaries which makes it inconsistent and time 
consuming. [8, 9]. Another method which is considered very convenient is an automatic segmentation. The 
speech can be automatically segmented into sub word units which are defined acoustically. [10] In Automatic 
Speech Recognition ASR systems, segmentation can be performed: (1) At the system training stage, when 
segmentation is applied to the training set recordings. (11) At the recognition stage [5]. 


2. SEGMENTATION TECHNIQUES 

Several well-established segmentation techniques have been proposed by previous researchers, such 
as in [10], that audio segmentation is performed using segment features. The proposed technique uses a log- 
linear segment model to determine the segmentation of the input audio stream [11]. First, the audio data is 
processed with a speaker independent acoustic model [12]. The decoding process will hypothesis the locations 
of sentence start and end. The resulting segments are also clustered and used in Constrained maximum 
likelihood linear regression (CMLLR) feature transformations and maximum likelihood linear regression 

(MLLR) mean transformations. The experimental results in [10] shows that the framework is applicable for 

various segments, boundary features, and for different change point detection methods. 

The Hidden Markov Model (HMM) is one of the highly-used segmentation techniques. A refined 

HMM algorithm was tested for segmenting a Chinese corpus [11]. The method is carried out in 3 steps: 

1. Obtain initial segmentation marks using HMM with forced alignment. 

2. Create a super vector for each boundary of this database by placing acoustic vector near the boundary. The 
pseudo-triphone formed from the boundary are classified using a classification and regression tree (CART) 
where the pseudo-triphone are clustered into smaller number of classes. Then each leaf node on the CART 
is used to train a Gaussian Mixture Model (GMM). 

3. For each labelled sentence, attempt to refine the boundary of each segment. Using the HMM boundary 
obtained above, compute the likelihood of this frame contains the actual boundary. The optimal boundary 
is assumed to be the frame that has maximum likelihood of the GMM model associated with the CART 
leaf node for the pseudo-triphone. 

Experimental results in [5] shows that the refined HMM is more accurate than the standard 

HMM segmentation. 

The Brandt’s generalized likelihood ratio (GLR) method aims to detect discontinuities in homogenous 
segment of the speech signal models using statistics to detect sequentially abrupt changes in the parameter of 
the model [11-14]. The signal Yn, 1s decribed using an autoregressive model M, such that 


Y=) ue 
M(A,o) f(x) = n i-1 Fi In _ Cn (1) 
var(e,) = a 
where e, is a zero-mean noise with variance 0”. 
Assume the audio signal is windowed as in Figure 1. 
I ini 
Wi Ww2 
Figure 1. Location of three windows in Brandt’s GLR 
where W1 1s decribed the signal (Y,,......, Y:) and W2 describes the signal (Yri1,.......... ,Yn). A jump 1s detected 
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and Do 1s the predefined threshold. 

This research will focus on the use of Brandt’s GLR algorithm on segmenting Malay words. The 
Malay language is one of many Austronesian languages which also includes languages such as Pilipino and 
Tagalog [15]. It is officially used in Malaysia and Singapore. Some aspects of the language are influenced by 
English and some words are directly borrowed from English itself. ASR system for Malay is challenging due 
to various regional dialects used and the occasional use of English words in sentences. Structures for the Malay 
syllables exits in the form of V, CV, CVC, CCV, CCVC, and CCCVC [16]. C’s are consonant and V’s are 
vowels. Syllable borders have two significantly different energy clusters which are visually noticeable however 
when pronounced continuously and closely together, false abrupt changes might occur depending on the 
speaker’s utterance style [17]. For read mode, the patterns are quite obvious and easily detected as only minimal 
amount of noise is present but in spontaneous mode, the presence of background noise, talking pace, and 
interference from other speaker may cause difficulty during segmentation [18]. 


3. METHODOLOGY 

The framework of this research is shown in Figure 2. Speech signal data in the form of Malay Poems 
or known as Pantun 1s used to test the Brandt’s GLR. The Pantun represents read mode audio data where audio 
recording 1s done in controlled environments and the speaker controls his/her manner of speech to ensure clear 
pronunciation and fixed reading pace [19, 20]. The Poems (Pantuns) are framed per sentence. No windowing 
method is used due to the Brandt’s GLR only working on detecting energy difference in time domain. Further 
processes are explained in detail in the following sections. 


Speech Signal Windowing and framing |— Sentence Extraction 


(Manual) 





Segmentation (Word Segmentation (Word 
level) using Brandt’s GLR level) Manually 





= 


Evaluation 
| 
Result 


Figure 2. Research framework 


3.1. Data collection 

As mentioned earlier, the data used during testing are in the form of traditional Malay Poems (Pantun) 
read by a male speaker in a noise free room. Instances of modern poems, as well as other types of poems, can 
be found in all sorts of printed and electronic documents including books, newspapers, magazines, and websites 
[21]. An instance of modern Malay poems can represent a complete poem or a poem portion. Five (5) poems 
(pantuns) of 10 sentences long each are first manually segmented. Segmentation both manually and 
automatically is done on frames consisting of one sentence long cut from the Pantuns. Manual segmentation is 
performed using the wavesurfer program as shown in Figure 3. 

Segmentations are done on word by word by word basis visually by observing the waveform while 
listening to the audio as well. Referring to Figure 3, the energy difference between words are visually obvious 
since the data 1s in read mode and there are clear silences in between words. 
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Figure 3. Manual segmentation using on the sentence “baik-baik jaga pedoman”’ using the wavesurfer 


J 


program. Words segmented are “baik”, “baik”, “jaga”, “pedoman”’ 


3.2. Brandt’s GLR algorithm 
Noise filtering is not required since little to no noise is present in the data. The data is first framed into 
individual frames of one (1) sentence each [22-24]. Then the Brandt’s GLR ratio is calculated for each sample 
as follow, using the windows in Figure 4. 
1. The covariance of W1 and W2 are calculated using brute force calculations. W1 will start at r= 1 and will 
grow until n and the variance values of values for both W1 and W2 are calculated. 
2. The variation of Dz is then calculated using equation (1) and graph Dyis plot. 


ES] Form1 == | x 


Close Threshold: [200000 
Log 


C:\Users \Firdaus\\Documents\Special Topic - Malay Syllable Segmentation\sample - cut\p 1r1.wav 
Total time in seconds: 2.793s 





Sample count : 44 
Interval : 6398 samples 








Figure 4. Top shows the signal waveform for sentence “baik-baik jaga pedoman” Bottom shows variation of 
Dy. The red lines show the points of segmentation 


The segmentation points are then acquired by observing the highest ratio calculated by the Brandt’s 
GLR. Resulting segmentation points from the Brandt’s GLR method are compared to the reference segmented 
points from the manual segmentation that was done using the wavesurfer program by us. The measurement 
criterion is adapted from [25]. Let K = {K1, Ko,....... »An} and R= {Ry, Ry,....... »Rn} be the segmented points 
in seconds obtained from the Brandt’s GLR and manual segmentation respectively. For each Kj, the 
corresponding point Ry; is determined by the time instance closes to that of K;. Thus a sequence Rx = {Ru, 
Ry,....- Rin} 18 build to compare both segmentations. 

Omission can be detected as when point in Rx is not in Kj; and insertion when points in K; is not in Rx. 
Number of similar points in both Rx and Kj are calculated as Match, m = (m/p * 100) where p is the number of 
points in R [26, 27]. Accuracy is calculated using, accuracy = ((m/p+n) * 100) which 1s affected by the number 
of insertions. The Brandt’s GLR method will be evaluated in terms of the number of omission and insertions, 
matches, and accuracy [28]. 
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3.2 Short-term energy algorithm 

The energy parameter has been used in speech segmentation since the 1970’s [29]. This algorithm 
was adopted and modified to better locate the beginning and ending of speech points for the isolated spoken 
Malay utterances and will be discussed in detail. This is a two-step search algorithm where the absolute energy 
(AE) for a coarse search 1s first used [19]. The speech signal was first divided into 50% overlapping frames of 
10ms and then passed through a rectangular window [30, 31]. The AE was computed by summing the absolute 
magnitudes of speech samples in each frame as shown in (3). 


m 
Bes 2 s(n)w(m — n)| 
n=m—N-+1 (3) 


where, w(m) rectangular window, N length frame duration ending at n=m and m peech samples overlapping 
at 10 ms. 

The mean and standard deviation of the AE measure 1s first computed during the first 50ms of the 
speech, assuming there is only background noise in that interval [22]. This information was further used to 
compute the peak energy (IMX) for the entire interval in each speech sample and the silence energy (IMN) 
[23, 24]. Subsequently the IMX and IMN were used to set two energy thresholds: upper threshold (7,,) and 
lower threshold (7;) according to (4). 


7 = mtn{1+ Def ME) 


IMN (4) 

The upper threshold (Tu) will be computed as in (5) and (6). 
dil (5) 
T,, =T, +0.25(W,, —T,) é 


where, WL word length, / is index of all frames, having EG)>T1 
Therefore, upper level for average energy is set to 0.25 based on experimental findings in case of 
high noise [25]. 


4. RESULTS AND ANALYSIS 

In the experiments, the Brand’s GLR is applied on periodic frames of 0.4 seconds. The algorithm was 
tested on frames of 0.2 seconds but was find to create high amount of insertion thus lowering the accuracy of 
the segmentation. At 0.2 seconds, Brandt’s GLR produces twice the amount of segmentation points compared 
to the reference segmentation points as shown in Figure 5. 

Each pantun is read in a controlled rhythm where each of the words in each sentence is read 
approximately 0.5 seconds apart from each other. Therefore 0.4 second framing is relatively effective for this 
type of segmentation. Pantun five (5) shows the worst accuracy as a lot of the words are made of prefixes such 
as dihati and membujang. Insertion occurs in between the prefix and the word hence lowering the accuracy of 
it. All of the other pantuns manage to be segmented with 80% accuracy with a 0.2 second tolerance. 

Nine out of 20 of the data managed to be 100% segmented and overall result 1s presented in Table 1. 
And the 5" sentence from pantun two (2) achieved 100% segmentation within 0.1 second time tolerance. In 
that sentence, “suka hati kumbang yang terbang”’, none of the words contains prefixes and suffixes, and 
contains no more than two syllable per words. Suffixes and prefixes can sometimes be captured as new words. 
For example, in the 3" sentence of pantun five (5), the prefix “membu” in “membujang”’ was captured as a 
separate word. 0.4 second frames are chosen as it manages to segment words that are two syllables long without 
over segmenting. This however will cause over-segmentation in words that are three syllables or more which 
is commonly due to the presence of prefixes or suffixes. 
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Average accuracy of each pantun vs time tolerance 





Figure 5. Average accuracy of each pantun vs time tolerance 
Table 1. Overall segmentation results for five (5) pantun 
0 second 0.1 second 0.2 second 
m n accuracy p m n accuracy p m n accuracy 
(auto) (match) (miss) (%) (auto) (match) (miss) (%) (auto) (match) (miss) (%) 

7 5 2 55.56 7 6 1 75.00 7 7 0 100.00 
8 2 6 14.29 8 4 4 33.33 8 5 3 45.45 
7 1 6 7.69 | 3 4 Died qd 7 0 100.00 
fi 2 5 16.67 7 6 1 75.00 7 6 1 75.00 
6 Z 4 20.00 6 5 1 71.43 6 6 0 100.00 
8 3 a 23.08 8 2) 3 45.45 8 6 2 60.00 
a 2 5 16.67 7 3 4 242) 7 6 1 75.00 
7 2 5 16.67 7 7 0 100.0 7 i 0 100.00 
fi 2 5 16.67 7 4 3 40.00 7 6 1 75.00 
7 4 3 40.00 7 5 2 55.56 7 7 0 100.00 
6 2 4 20.00 6 5 1 71.43 6 6 0 100.00 
7 a 4 272d | 4 3 40.00 q 6 1 75.00 
7 3 4 21d 7 6 1 75.00 7 7 0 100.00 
a 2 5 16.67 7 5 2 55.56 d 7 0 100.00 
fi 0 7 0.00 | 4 3 40.00 | 5 2 55.56 
7 3 4 Dla 7 6 1 75.00 7 7 0 100.00 
7 1 6 7.69 7 5 2 55.56 7 5 2 55.56 
9 Z i 12.50 9 3 6 20.00 9 7 2 63.64 
8 1 7 6.67 8 4 4 ie Fe Je Fe 8 6 2 60.00 
5 1 4 11.11 5 3 3 37.50 5 5 1 83.33 


Figure 6 shows the reference segmentation of the word “membujang” and Figure 7 shows how the 


algorithm did it. The word “membujang” was read 0.6 seconds long, which was captured by two separate GLR 
frames thus causing over segmentation. 
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Figure 6. Waveform for the sentence “sudah lama hidup membujang’”’, the red line shows the start of the 
word “membujang”’ and the blue line marks the end of the prefix “mem”’ 
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Figure 7. Automatic segmentation of the sentence “sudah lama hidup membujang”’. 
Red line shows segmentation points 


5. CONCLUSION 

Four out of five of the pantun managed to be segmented with 80% accuracy. However, it 1s to be noted 
that all the data is in read mode and was recited in a controlled rhyme thus making the segmentation process a 
lot simpler than if to be done on spontaneous speech where there will be multiple speakers which all speak at 
different pace. Salam recommended in [7] to use higher order of autoregressive model to purposely cause over 
segmentation and to remove the insertions using Neural Network. This might also help with segmentation of 
spontaneous data. To test this would be our future goal. 
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